Synthetic Word Parsing Improves Chinese Word Segmentation
نویسندگان
چکیده
We present a novel solution to improve the performance of Chinese word segmentation (CWS) using a synthetic word parser. The parser analyses the internal structure of words, and attempts to convert out-of-vocabulary words (OOVs) into in-vocabulary fine-grained sub-words. We propose a pipeline CWS system that first predicts this fine-grained segmentation, then chunks the output to reconstruct the original word segmentation standard. We achieve competitive results on the PKU and MSR datasets, with substantial improvements in OOV recall.
منابع مشابه
Parsing Chinese Synthetic Words with a Character-based Dependency Model
Synthetic word analysis is a potentially important but relatively unexplored problem in Chinese natural language processing. Two issues with the conventional pipeline methods involving word segmentation are (1) the lack of a common segmentation standard and (2) the poor segmentation performance on OOV words. These issues may be circumvented if we adopt the view of character-based parsing, provi...
متن کاملWord-Context Character Embeddings for Chinese Word Segmentation
Neural parsers have benefited from automatically labeled data via dependencycontext word embeddings. We investigate training character embeddings on a word-based context in a similar way, showing that the simple method significantly improves state-of-the-art neural word segmentation models, beating tritraining baselines for leveraging autosegmented data.
متن کاملAn Efficient Chinese Parsing Algorithm for Computer-Assisted Language Learning
Instructional grammar is often used in Computer-assisted Language Learning (CALL) and the grammatical error detection is an important feature. However, it is not an easy task in Chinese language. There is no delimiter separating consecutive words in Chinese sentences. Word segmentation is a process in which proper word boundaries are identified. Before syntactic parsing of a Chinese sentence, w...
متن کاملCharacter-Level Dependency Model for Joint Word Segmentation, POS Tagging, and Dependency Parsing in Chinese
Recent work on joint word segmentation, POS (Part Of Speech) tagging, and dependency parsing in Chinese has two key problems: the first is that word segmentation based on character and dependency parsing based on word were not combined well in the transition-based framework, and the second is that the joint model suffers from the insufficiency of annotated corpus. In order to resolve the first ...
متن کاملJoint Chinese Word Segmentation, POS Tagging and Parsing
In this paper, we propose a novel decoding algorithm for discriminative joint Chinese word segmentation, part-of-speech (POS) tagging, and parsing. Previous work often used a pipeline method – Chinese word segmentation followed by POS tagging and parsing, which suffers from error propagation and is unable to leverage information in later modules for earlier components. In our approach, we train...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015